Project 3: Data Analysis with R

Udacity Nanodegree - November Cohort

Loan Data from Prosper Exploration by Michael Strobl

Here you can download the dataset: Loan Data from Prosper and Explanations to the Dataset
Prosper is a platform where individuals can invest in personal loans or request to borrow money.

Here is a Youtube Video from Fox Business with the CEO of Prosper who explains the Propser System.

Dataset: Read in Dataset and Libraries:

loan <- read.csv('prosperLoanData.csv')
library(ggplot2)
library(gridExtra)
library(RColorBrewer)


Note: The dataset ‘prosperLoanData.csv’ must be in the same folder as this ‘Project_3.Rmd’ file.

1 Univariate Plots Section

In the following, 19 variables of the Dataset are plotted and described.These are divided in numeric and categorial data. The numeric data is also described with a R summary commmand.

Numeric Data


Variable 1: Loan Original Amount


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

The loans are between 1000 and 35000 Dollar. The most loans are distributed between 5000 and 15000 Dollar. You can also see peaks at exactly 5000, 10000, 15000, 20000, 25000 USD.

Variable 2: Borrower Annual Percentage Rate (APR)


Note: To avoid Outliers in the plot, the dataset was reduced from 0.1% to 99.9% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230      25

The most Borrower APRs are between 15% and 30%, except the biggest frequency with rates around 36%.

Variable 3: Borrower Rate


Note: To avoid Outliers in the plot, the dataset was reduced from 0.1% to 99.9% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The highest frequency of rates are around 32%, followed by around 15% and 20%

Variable 4: Lender Yield


Lender Yield is the Borrower Rate less Service Fees.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

The Lender Yield can be negative and the highest frequency is around 30%, followed by 16%, 15% and 22%.

Variable 5: Investors


Note: To avoid Outliers in the plot, the dataset was reduced from 0% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

The distribution seems to be a long tail. The most loans have 1 investor and then the number of loans is decreasing with higher number of investors.

Variable 6: Monthly Loan Payment


Note: To avoid Outliers in the plot, the dataset was reduced from 0% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2252.0

Most monthly loan payments are around 200 USD.

Variable 7: Stated Monthly Income


Note: To avoid Outliers in the plot, the dataset was from 0 to 20000 USD.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750000

The most Borrowers have an income between 3000 and 6000 USD.

Variable 8: Debt To Income Ratio


Note: To avoid Outliers in the plot, the dataset was from 0 to 1%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

The most Borrowers have a Debt to Income Ratio of 10% to 30%.

Variable 9: Average Credit Score


The average Credit Score is the median of a upper and a lower credit score ranking by the consumer credit rating agency.

#AverageCreditScore
loan$AverageCreditScore <- (loan$CreditScoreRangeLower+
                              loan$CreditScoreRangeUpper)/2


Note: To avoid Outliers in the plot, the dataset was reduced from 1% to 99% Quantile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

The most loans have a rating between 650 and 750 points.

Variable 10: Loan Per Investor



#LoanPerInvestor
Investors2 <- subset(loan, Investors > 1)
Investors2$LoanPerInvestor <- Investors2$
    LoanOriginalAmount/Investors2$Investors



##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     7.299    48.190    70.000   220.200   125.000 12500.000

Most Investors give loans of 20 to 100 USD. 50 USD is the absolute peak.

Variable 11: Debt Service Coverage Ratio/Debt Coverage Ratio


Explanation of the Debt Coverage Ratio
It’s the relationship between Stated Monthly Income and Monthly Loan Payment.

#Debt Coverage Ratio
loan$DebtCoverageRatio2 = loan$StatedMonthlyIncome/loan$MonthlyLoanPayment


Note: To avoid Outliers in the plot, the dataset was reduced from 0 to 100.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      13      20     Inf      35     Inf      15

The most Borrowers have a Debt Coverage Ratio around 20. That means their income is 20 times their monthly loan payment. The higher the value the more trustworthy is the borrower.

Categorial Data


Variable 12: Term



The loans have three possible terms: 12, 36, 60 months. 77% of the loans have a term of 36 months.

Variable 13: Loan Status



The most loans are current, followed by completed, chargedoff and defaulted.

Variable 14: Employment Status



The most loans are given to employed, full-time and self employed people.

Variable 15: Loan Categories


Note: To avoid overplotting, the dataset was reduced to the top 10 Loan Categories.

The most loan categories which are defined (not NA oder Other) are for debt consolidation, home improvements, business and auto.

Variable 16: Home Owner


## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == FALSE, value
## = structure(c(2L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == TRUE, value =
## structure(c(2L, : invalid factor level, NA generated

The Borrowers with own Homes have a small lead to ones without Homes.

Variable 17: Top Occupations


Note: To avoid overplotting, the dataset was reduced to the top 10 Occupations.

The occupations with the highest frequency are administrative assistants, analysts and accountants/CPAs.


Variable 18: Year



The Year Variable was created with the Lubridate Library.
In 2013, the most loans were given and in 2009 was a big fall of loans.



What is the structure of your dataset?

Size of Dataset: 86,5 MB Variables: 81 Number of Loans in the Dataset: about 114000
Volume of all Loans: about 950 Million
Number of Investors: about 9,2 Million
Average Invest: USD 103
Average Loan: USD 8300
Minimum Loan: USD 1000
Maximum Loan: USD 35000
Terms: 12, 36 or 60 months

What are the main features of interest in your dataset?

I want to see:
- How high are the loans people take with the prosper platform (main feature: Loan OriginalAmount)?
- How expensive are the loans (main feature: Borrower Rate)?

What other features in the dataset do you think will help support your investigation into your feature of interest?


- Who are the people who take the loans?
- For what do people take the loan?
- How many investors has a loan?
- Is there a rating for the loans?

Did you create any new variables from existing variables in the dataset?

Average Credit Score
Loan Per Investor
Debt Coverage Ratio

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

BorrowerRate, BorrowerAPR and LenderYield are almost equal to a normal distribution and Investors seems to be a long tail distribution. Therefore, I factorised the variables “Term” and changed empty values to “NA”. ListingCategory..numeric. is changed to ListingCategory where all numeric data is referred to their meanings, for example: “1” is referred to “Debt Consolidation”.

2 Bivariate Analysis


BorrowerRate vs BorrowerAPR vs LenderYield



Correlation:
a) BorrowerRate & BorrowerAPR

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and BorrowerAPR
## t = 2347.699, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9897057 0.9899409
## sample estimates:
##      cor 
## 0.989824


b) BorrowerRate & LenderYield

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LenderYield
## t = 8493.938, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9992021 0.9992204
## sample estimates:
##       cor 
## 0.9992113


c) BorrowerAPR & LenderYield

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerAPR and LenderYield
## t = 2291.732, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9892049 0.9894515
## sample estimates:
##       cor 
## 0.9893289

Conclusion: The distributions of all three features seem quite similar. Also, all three features have correlations of almost 1. That’s why the following analysis is reduced to one feature, Borrower Rate.

1. Loan Original Amount

1.1 Borrower Rate vs Loan Original Amount


## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LoanOriginalAmount
## t = -117.5822, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

Note: The red dot shows the mean of the plotted feature(s)
The most loans have an amount at 5000,10000,15000 and 20000 USD and the BorrowerRate have a huge spreading between 5% and 35%. The value of -32% indicates a small negative correlation between Loan Original Amount and Borrower Rate.

1.2 Investors vs Loan Original Amount

## 
##  Pearson's product-moment correlation
## 
## data:  Investors and LoanOriginalAmount
## t = 138.7077, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3751140 0.3850494
## sample estimates:
##       cor 
## 0.3800926


Note: The red dot shows the mean of the plotted feature(s)
The loans around 25000 USD have the most investors but there are also many high loans with only 1 investors.
The value of 38% indicates a small positive correlation between Investors and Loan Original Amount.

In the following, only loans with more than 1 investor is analysed.

## 
##  Pearson's product-moment correlation
## 
## data:  Investors and LoanOriginalAmount
## t = 263.0167, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6636999 0.6711074
## sample estimates:
##       cor 
## 0.6674202

The value rises from 38% to 67% and indicates a high positive correlation between Loan Original Amount and Investors.


1.3 Monthly Loan Payment vs Loan Original Amount

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 867.8179, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9312165 0.9327426
## sample estimates:
##       cor 
## 0.9319837

You can see three different lines which represent the three different terms. The higher the loan, the higher is the monthly Loan Payment.
This proofs also the Pearson Correlation Test. There is a almost perfect correlation of 93%.

1.4 Average Credit Score vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
There is a huge spread between a credit score of 500 and 800. Loans over 10000 USD have mostly a credit score over 700. Loans under 10000 USD are mostly be realised with a credit score of 500 and higher.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and AverageCreditScore
## t = 122.0719, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3357190 0.3460095
## sample estimates:
##       cor 
## 0.3408745

A value of -34% shows a weak negative correlation between Average Credit Score and Loan Original Amount.


1.5 Stated Monthly Income vs Loan Original Amount



People have mostly higher loans when they have a higher income. But they are also exceptions (see bottom right)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 69.3527, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1956816 0.2068243
## sample estimates:
##       cor 
## 0.2012595

A value of 20% shows a weak positive correlation between Stated Monthly Income and Loan Original Amount.


1.6 Debt To Income Ratio vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
It seems that the Debt To Income Ratio has no visible effect on the Loan Original Amount.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtToIncomeRatio
## t = 3.2828, df = 105381, p-value = 0.001028
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.004074882 0.016148830
## sample estimates:
##        cor 
## 0.01011222

The value of 0.01 proofs this fact and shows almost no correlation between the two features.


1.7 Loan Per Investor vs Loan Original Amount



The most loans between 0 and 25000 USD are created by single invests of 50 to 100 USD.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanPerInvestor and LoanOriginalAmount
## t = 35.555, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1136895 0.1268536
## sample estimates:
##       cor 
## 0.1202769

A value of 12% shows a weak positive correlation between these two features.


1.8 Debt Coverage Ratio vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
Top left shows high loans with low Debt Coverage Ratio while bottom right shows low loans with high Debt Coverage Ratio.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = -29.1052, df = 113000, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09204415 -0.08046991
## sample estimates:
##         cor 
## -0.08625994


A value of -8% shows a weak negative correlation between these two features.




1.9 Term vs Loan Original Amount


Note: The red dot shows the mean of the plotted feature(s)
The higher the term, the higher the Loan Original Amount.

1.10 Loan Status vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
Current loans have the highest loan amounts while cancelled has the lowest.

1.11 Employment Status vs Loan Orignal Amount



Note: The red dot shows the mean of the plotted feature(s)
Employed and Full-time have the highest loans while retired and not-employed are rarely over 10000 USD.

1.12 Loan Categories vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The highest amounts have debt consolidation, business and baby & adotpion.

1.13 Homeowner vs Loan Original Amount


## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == FALSE, value
## = structure(c(1L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == TRUE, value =
## structure(c(1L, : invalid factor level, NA generated

Note: The red dot shows the mean of the plotted feature(s)
Homeowner have higher amounts than non-homeowner.

1.14 Top Occupations vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The highest loans have Attorneys and the lowest have Bus Drivers.

1.15 Year vs Loan Original Amount



Note: The red dot shows the mean of the plotted feature(s)
The year 2013 and 2014 have the highest loan amounts while 2008 has the lowest ones.


2. BorrowerRate


2.1 Investors vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
The most loans have rates between 5% and 35%, while the loans with the most investors have rates of 10 to 20%.

2.2 Monthly Loan Payment vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
Most payments are between 0 and 600 USD and have rates between 5% and 35%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and MonthlyLoanPayment
## t = -85.2021, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2501933 -0.2392759
## sample estimates:
##        cor 
## -0.2447424

A value of -24% shows a weak negative correlation between Borrower Rate and Monthly Loan Payment.

2.3 Average Credit Score vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
Scores from 500 to 650 seem to have similar rates around 25%. From 650 to 900 the rates fall from 25% to 10%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and AverageCreditScore
## t = -175.1695, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

A value of -46% shows a middle negative correlation between Borrower Rate and Average Credit Score.

2.4 Stated Monthly Income vs Borrower Rate



The Monthly Income doesn’t seem to have an huge effect to the Borrower Rate.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and StatedMonthlyIncome
## t = -30.1548, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09473938 -0.08321827
## sample estimates:
##        cor 
## -0.0889818

A value of -8% shows a weak negative correlation between Stated Monthly Income and Borrower Rate.


2.5 Debt To Income Ratio vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
The Debt to Income Ratio doesn’t seem to have a huge effect on the Borrower Rate.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and DebtToIncomeRatio
## t = 20.4649, df = 105381, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05690080 0.06892819
## sample estimates:
##        cor 
## 0.06291678

The value of 0.06 shows a weak positive correlation betweeen these two features.


2.6 Loan Per Investor vs Borrower Rate



It seems that Loan per Investor has no visible effect on the Borrower Rate.

## 
##  Pearson's product-moment correlation
## 
## data:  LoanPerInvestor and BorrowerRate
## t = 20.6332, df = 86121, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06348697 0.07677859
## sample estimates:
##        cor 
## 0.07013589

The value of 0.07 shows a weak positive correlation betweeen these two features.


2.7 Debt Coverage Ratio vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
All Debt Coverage Ratio seem to be distributed equally between 5% and 35%.

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and DebtCoverageRatio2
## t = 0.4327, df = 113000, p-value = 0.6652
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.004543397  0.007117585
## sample estimates:
##         cor 
## 0.001287138

The value of 0,12% shows almost no correlation between Borrower Rate and Debt Coverage Ratio.


2.8 Term vs Borrower Rate


Note: The red dot shows the mean of the plotted feature(s)
Terms of 36 or 60 months have higher rates than ones of 12 months.

2.9 Loan Status vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
Current and cancelled loans have the lowes borrower rates while rates with past due status and or defaulted and chargedoff rates have the highest.

2.10 Employment Status vs Borrower Rate



Note: The red dot shows the mean of the plotted feature(s)
Not-employed and Other have the highest rates while the other categories seem to have rates around 20%.

2.11 Loan Categories vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
Auto and Other have higher rates than the others. Boat and Debt Consolidation have the lowest.

2.12 Homeowner vs Borrower Rates


## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == FALSE, value
## = structure(c(1L, : invalid factor level, NA generated
## Warning in `[<-.factor`(`*tmp*`, loan$IsBorrowerHomeowner == TRUE, value =
## structure(c(1L, : invalid factor level, NA generated

Note: The red dot shows the mean of the plotted feature(s)
Homeowner have lower rates than non-homeowner.

2.13 Top Occupations vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
Administrative Assistans and Bus Drivers have the highest interest rates while Attorneys and Architects have the lowest.

2.14 Year vs Borrower Rates



Note: The red dot shows the mean of the plotted feature(s)
The interest rates are below 20% from 2005 to 2009, above 20% from 2010 to 2012 and again below 20% in 2013 and in 2014.

3. Additional bivariate Plots


3.1 Investors vs Monthly Loan Payment


## 
##  Pearson's product-moment correlation
## 
## data:  Investors and MonthlyLoanPayment
## t = 141.8441, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3824632 0.3923333
## sample estimates:
##       cor 
## 0.3874093

Note: The red dot shows the mean of the plotted feature(s)
The most loans have Monthly Loan Payments between 0 and 1000 USD and between 0 and 300 Investors. The red dots show the means of each Investors-Monthly Loan Payment-Pair. According to the Pearson Correlation, there is a weak positive relation between the pairs.

3.2 Investors vs Average Credit Score


## 
##  Pearson's product-moment correlation
## 
## data:  Investors and AverageCreditScore
## t = 94.9155, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2659485 0.2767345
## sample estimates:
##     cor 
## 0.27135


Most loans have Average Credit Scores between 600 to 800 points and between 0 and 500 investors. There is a weak positive correlation between Investors and Average Credit Scores.

3.3 Monthly Loan Payment vs Average Credit Score


## 
##  Pearson's product-moment correlation
## 
## data:  MonthlyLoanPayment and AverageCreditScore
## t = 102.9909, df = 113344, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2871995 0.2978465
## sample estimates:
##      cor 
## 0.292532

Note: The red dot shows the mean of the plotted feature(s)
The Monthly Loan Payments between 600 and 800 have scores between 600 to 800.

3.4 Term vs Investors



Note: The red dot shows the mean of the plotted feature(s)
Loans with a term of 12 and 36 months have more investors than 60 month loans.

3.5 Loan Categories vs Investors



Note: The red dot shows the mean of the plotted feature(s)
Business and Personal Loans have the most investors while Boat, Baby & Adoption and Other have the fewest.

3.6 Year vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Monthly Loan Payments have been risen over the years except a fall around 2009.

3.7 Top Occupations vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Attorneys have the highest monthly loan payments while Bus Drivers and Administrative Assistants have the lowest.

3.8 Employment Status vs Monthly Loan Payment



Note: The red dot shows the mean of the plotted feature(s)
Employed and Self Employed People have higher monthly loan payments than retired, part-time or not employed People.

3.9 Homeowner vs Average Credit Score



Note: The red dot shows the mean of the plotted feature(s)
Homeowners have higher Average Credit Scores than non-homeowner.

3.10 Loan Status vs Average Credit Score



Note: The red dot shows the mean of the plotted feature(s)
Current, Completed and Final Payment in Progress Loans have higher Scores than Cancelled or Defaulted Loans.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The two main features are Loan Original Amount and Borrower Rate. First, Loan Original Amount:
The Features Investors (1.2), Average Credit Score (1.4) and Term (1.9) have a positive influence on the Loan Original Amount. Higher Amounts need higher Investors, Average Credit Scores or Terms.
When you look at the feature Homeowner (1.13), you can see that people with homes have higher amounts than people without homes. This makes sense because they have more securities.
The feature Occupation (1.14) is also interesting. High educated Occupations like Attorneys and Chemists get higher loans than Bus Drivers or Administative Assistants.
Finally, the feature Year (1.15): Over time, the people get higher loans. The Prosper Platform seems to get more trustworthy in the last 10 years except 2009.

Second, Borrower Rate:
Most features have no direct effect on the Borrower Rate because they stay between 5% and 35% like Stated Monthly Income (2.4) or Term (2.8). But other features have visible effects.
First, Average Credit Score (2.3). The higher the score the lower the interest rate.
Secondly, Homeowner (2.12) have lower rates than non-homeowner.
Thirdly, Occupation: Attorneys have lower rates than Bus Drivers (see above Loan Original Amount).
Finally, Year (2.14): Over time, the rates have been decreased, except the fall in 2008/2009, probably because of the financial crisis.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Monthly Loan Payment vs Year (3.6) is interesting. It’s like Borrower Rate vs Year and Loan Original Amount vs Year (see above). Over time, people are paying more for their loans taken over the Prosper platform. Perhaps, they use this platform now more than other institutions like banks.
Occupation vs Monthly Loan Payment (3.7): You can see that the Occupation has an effect to the Monthly Loan Payment. Attorneys and Architectscan pay higher loans than the others.
Homeowners vs Average Credit Score (3.8): Homeowner get higher scores than non-homeowner. This makes sense because to get a credit for a house you need automatically high scores in the beginning.

What was the strongest relationship you found?


The strongest relationships were between Borrower Rate, APR and Lender Yield in the beginning with correlations of almost 1. Next, the correlation between Monthly Loan Payment and Loan Original Amount of 93%.


3 Multivariate Analysis


1. Loan Original Amount by Monthly Loan Payment and Term



This plot shows Loan Original Amount vs Monthly Loan Payment by Term. You can see three different lines with the three different terms. Becauses of the almost perfect correlation between Loan Original Amount and Monthly Loan Payment.
Correlations:
Loan Original Amount by Monthly Loan Payment
Term: 12 Months

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 90.8143, df = 1612, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9062526 0.9222398
## sample estimates:
##      cor 
## 0.914603


Term: 36 Months

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 1779.007, df = 87776, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9862350 0.9865921
## sample estimates:
##       cor 
## 0.9864147


Term: 60 Months

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and MonthlyLoanPayment
## t = 710.2177, df = 24543, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9759372 0.9770983
## sample estimates:
##       cor 
## 0.9765248

All three correlations are beyond 90%, and Term 36 has the highest positive correlation between Monthly Loan Payment and Loan Original Amount.

2. Loan Original Amount by Borrower Rate and Homeowners



Homeowner (red, top left) seem to have higher loans with lower rates than non-homeowner (green, bottom right). One explanation can be the securities the home offers.
Correlation between Loan Original Amount and Borrower Rate:
Homeowner: yes:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and BorrowerRate
## t = -82.0419, df = 57476, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3310751 -0.3164386
## sample estimates:
##        cor 
## -0.3237762



Homeowner: no:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and BorrowerRate
## t = -74.1159, df = 56457, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3052753 -0.2902407
## sample estimates:
##        cor 
## -0.2977764

The correlations of -32% vs -29% seem equally. There is no big difference in the correlation between Loan Original Amount vs BorrowerRate if the Borrower has a home or not.

3. Borrower Rate by Investors and Homeowners



Non homeowner (left, darkgreen) have fewer investors while homeowner (bottom right, red) have more investors and mostly rates below 20%.


Correlation between Borrower Rate and Investors:
Homeowner: yes

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and Investors
## t = -70.3709, df = 57476, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2891552 -0.2741017
## sample estimates:
##        cor 
## -0.2816457



Homeowner: no

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and Investors
## t = -59.0455, df = 56457, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2489195 -0.2333816
## sample estimates:
##       cor 
## -0.241166


There is no big difference in the correlations between Loan Original Amount and Investors when you compare Borrowers with or without own homes.

4. Loan Original Amount by Stated Monthly Income and Year




Note: The Year 2005 is missing because they are only 22 obversations in the dataset
In the later years, the Borrower seem to have higher Stated Monthly Incomes (right) while the Loan Original Amount seems similar.


Correlation between Loan Original Amount and Stated Monthly Income:
Years 2006 to 2009:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 47.1695, df = 30963, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2485014 0.2692846
## sample estimates:
##      cor 
## 0.258923



Years 2010 to 2014:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 53.2303, df = 82948, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1751564 0.1883172
## sample estimates:
##       cor 
## 0.1817449

There is a little difference between both correlations. The correlation of the years 2006 to 2009 is a little bit higher (25%) than the one of the years 2010 to 2014 (18%).

5. Loan Original Amount by Average Credit Score and Employment Status


The Employment Statuses are gathered in full-time employments and no full-time employments for more clearness.

Full time employed people have much higher scores and loans than non full time employed people.


Correlation between Loan Original Amount and Average Credit Score: The Employment Statuses are gathered in full-time employments and no full-time employments for more clearness.
Full-time Employments (Employed, Full-time, Self-employed)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and AverageCreditScore
## t = 112.863, df = 99809, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3309088 0.3419122
## sample estimates:
##      cor 
## 0.336422



No full-time Employments (Not available, Other, Part-time, Not employed, Retired)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and AverageCreditScore
## t = 38.2238, df = 13533, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2968717 0.3272835
## sample estimates:
##       cor 
## 0.3121576

Both correlations have values around 30%.

6. Loan Original Amount by Debt Coverage Ratio and Homeowners



You can see a difference between Homeowner and Non-homeowners when you want to compare Loan Original Amount vs Debt Coverage Ratio. The Homeowners (red) get higher loans than Non-Homeowners (green) when you compare similar Debt Coverage Ratios.
Correlation:
Homeowner: yes

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = -24.8585, df = 57099, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.11157976 -0.09535107
## sample estimates:
##        cor 
## -0.1034723



Homeowner: no

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = -18.2806, df = 55899, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08532444 -0.06884351
## sample estimates:
##         cor 
## -0.07708924


There is no big difference in the correlations between Loan Original Amount and Debt Coverage Ratio when you compare Borrowers with or without own homes.

7. Loan Original Amount by Loan Per Investor and Year




Note: The Year 2005 is missing because they are only 22 obversations in the dataset
The Loans Per Investor have moved over the time from an intervall of 20 to 60 USD to an intervall of 50 to 100 USD (from left to right in the plots). The Amount seems to be constant.

Correlation between Loan Original Amount and Loan Per Investor:
Years 2006 to 2009:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and LoanPerInvestor
## t = 9.7733, df = 30963, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04434563 0.06655356
## sample estimates:
##        cor 
## 0.05545645



Years 2010 to 2014:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and LoanPerInvestor
## t = 188.2829, df = 82948, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5424041 0.5519395
## sample estimates:
##       cor 
## 0.5471896

There is a huge difference between both correlations. While the earlier one from 2006 to 2009 has only 5% the later one from 2010 to 2014 has a strong value of 55%.

8. Borrower Rate by Loan Per Investor and Year




Note: The Year 2005 is missing because they are only 22 obversations in the dataset
The Borrower Rate seem to stay constant between 5% and 35% while the Loan Per Investor have risen in the later years.

Correlation between Borrower Rate and Loan Per Investor:
Years 2006 to 2009:

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LoanPerInvestor
## t = -2.633, df = 30963, p-value = 0.008468
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.026095389 -0.003823944
## sample estimates:
##         cor 
## -0.01496152



Years 2010 to 2014:

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LoanPerInvestor
## t = -79.2497, df = 82948, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2716199 -0.2589674
## sample estimates:
##        cor 
## -0.2653051

There is a huge difference between both correlations. While the earlier one from 2006 to 2009 has a value of almost 0% the later one has a value of -26% which is a weak negative correlation.

9. Loan Original Amount by Debt To Income Ratio and Top Occupations



Most people with the top Occupations have Debt to Income Ratio between 0 and 0.5 except Administrative Assistants and Bus Drivers. These people have also lower Loan Amounts than the others.

Correlation between Loan Original Amount and Debt To Income Ratio: I divided the 10 Occupations into high and low Income Jobs.

High Income Jobs (Analyst, Accountant/CPA, Attorney)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtToIncomeRatio
## t = 5.3086, df = 7431, p-value = 1.137e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03878575 0.08408237
## sample estimates:
##        cor 
## 0.06146571



Low Income Jobs (Administrative Assistant, Civil Service, Bus Driver, Architect, Car Dealer, Chemist, Biologist)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtToIncomeRatio
## t = 1.6699, df = 5869, p-value = 0.09499
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.003790085  0.047346546
## sample estimates:
##        cor 
## 0.02179248

Both correlations are really low.

10. Borrower Rate by Stated Monthly Income and Loan Categories



The Borrower Rate doesn’t seem to an visible effect on the different loan categories because all have loans between 5% and 35%. But you can see that people who use their loans for Debt Consolidation, Business and Home Improvements have the highest monthly Incomes.

Correlation between Borrower Rate and Stated Monthly Income: I divided the 10 Loan Categories into two 5-Categories correlations.
First 5 Loan Categories (Debt Consolidation, NA, Other, Home Improvement, Business)

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and StatedMonthlyIncome
## t = -27.2829, df = 100387, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09192925 -0.07964842
## sample estimates:
##         cor 
## -0.08579209



Last 5 Loan Categories (Auto, Personal Loan, Student Use, Baby & Adoption, Boat)

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and StatedMonthlyIncome
## t = -10.3106, df = 6005, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1566581 -0.1069589
## sample estimates:
##        cor 
## -0.1318914

The both correlations have only litte differences.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

First focus on feature Homeowner as color variable:
Plot 2 shows that Homeowner have better conditions than non homeowner because of higher amounts and lower interest rates.
Plot 3 shows that loans of Homeowners have more investors than loans without homes.
Plot 6 shows that Homeowners have higher Debt Coverage Ratios and higher loans than non homeowners
Secondly, the feature Year is interesting:
Plot 8 shows that Loan Per Investor has been increased although the Borrower Rates seemed to be constant.


Were there any interesting or surprising interactions between features?

Plot 10 is interesting. All of the 10 Occupations have similar Debt To Income Ratios, mostly between 0% and 50%. I expected that people with higher incomes have lower Debt To Income Ratios but all people seem to have similar ratios of debt/income.




4 Final Plots and Summary


1. Loan Original Amount by Debt Coverage Ratio and Year





Note: The Year 2005 is missing because they are only 22 obversations in the dataset
You can see that the Debt Coverage Ratio changed over time. In the early years 2005-2009 (bottom left) the amounts are middle and the coverage Ratio is low. There is also high Debt Coverage Ratios and low amounts (bottom right). In the later years 2010-2014 there are much more loans with higher amounts and lower Debt to Coverage Ratios top left). You can see more people are trusting the Prosper Platform with higher amounts and they are more trustworthy because of the higher ratios.

Correlation between Loan Original Amount and Debt to Coverage Ratio:
Years 2006 to 2009:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = -14.3583, df = 30493, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09308634 -0.07078963
## sample estimates:
##         cor 
## -0.08194824



Years 2010 to 2014:

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and DebtCoverageRatio2
## t = -26.8548, df = 82485, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09985947 -0.08632921
## sample estimates:
##         cor 
## -0.09309864

There is no big difference in the correlations of the early and the later years.

2. Loan Original Amount by Stated MonthlyIncome and Top Occupations




In these plots you can see how different occupations have different loans amounts guarantueed by their incomes. There are three groups. Group 1 are top loans with top incomes like Analysts, Accountants and Attorneys. Group 2 have top loans with lower incomes like Civil Service, Car Dealer and Chemists. Group 3 have low loans and low incomes like Administrative Assistants, Bus Drivers, Architects and Biologists. Group 1 is the most preferred group and that’s why they get high loans. Group 2 has also a good standing because they get high loans with lower incomes. But group 3 isn’t that interesting for loan givers. The result is they get lower loans than the other groups.

Correlation between Loan Original Amount and Stated Monthly Income: I divided the 10 Occupations into high and low Income Jobs.

High Income Jobs (Analyst, Accountant/CPA, Attorney)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 26.3029, df = 7879, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2636895 0.3042837
## sample estimates:
##       cor 
## 0.2841139



Low Income Jobs (Administrative Assistant, Civil Service, Bus Driver, Architect, Car Dealer, Chemist, Biologist)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 31.8228, df = 6122, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3550514 0.3980380
## sample estimates:
##       cor 
## 0.3767475

The low income jobs have a higher correlation than the high income jobs when you compare Loan Original Amount vs Stated Monthly Income.

3. Loan Original Amount by Stated Monthly Income and Loan Categories




People with high incomes (right in the plots) are investing their loans mostly in Debt Consolidation, Home Improvements and Business. People with low income s (left in the plots) are mostly investing in the above-mentioned categories and also in Student Use, Baby & Adoption, Auto and Boats.
Correlation between Loan Original Amount and Stated Monthly Income: I divided the 10 Loan Categories into two 5-Categories correlations.

First 5 Loan Categories (Debt Consolidation, NA, Other, Home Improvement, Business)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 61.7547, df = 100387, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1853423 0.1972614
## sample estimates:
##       cor 
## 0.1913089



Last 5 Loan Categories (Auto, Personal Loan, Student Use, Baby & Adoption, Boat)

## 
##  Pearson's product-moment correlation
## 
## data:  LoanOriginalAmount and StatedMonthlyIncome
## t = 28.6632, df = 6005, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3244730 0.3689677
## sample estimates:
##       cor 
## 0.3469155

The second correlation is higher than the first one.

5 Reflection


The most difficult part in this analysis was to choose the right variables. I reduced the dataset to 15 variables and created 3 new ones. I focused my analysis to the price of the loan (Borrower Rate) and the amount of the loan (Loan Original Amount). It’s interesting that only few features have an influence to the interest rate while the amount is much more influenced by features like Occupation, existing Homes or the Score of the Borrower.
Next step for me would be an analysis of all occupations. I focused now only the Occupations with the 10 highest counts. But there are 68 different Occupations.
Over these, the platform Prosper has evolved over time. More loans are given in 2014 than all the other years and the interest rates has been decreasing, too. There must be more and more trust in the platform by its users.
Therefore, the usage of the loans is interesting. Most loans were used for Debt Consolidation and Business.
I am also a little bit surprised about the long terms (at least 12 month in this dataset) and the high rates (median 18.4%). If you have an average loan of 1000 USD, 36 months and 18.4%, you have to pay at the end 1643.03 USD. I would look for cheaper alternatives or shorter terms.

References


Dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv
Explanation of Variables: https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0
http://en.wikipedia.org/wiki/Prosper_Marketplace
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
https://www.youtube.com/watch?v=qz2ZV-ELVfw
http://www.r-bloggers.com/r-function-of-the-day-tapply
http://www.dummies.com/how-to/content/how-to-interpret-a-correlation-coefficient-r.html


Thank you for your attention.